Journal of Artificial Intelligence Research 27 (2006) 119—151 



Submitted 1/06; published 10/06 



A Comparison of Different Machine Transliteration Models 



Jong-Hoon Oh rovellia@nict.go.jp 
Computational Linguistics Group 

National Institute of Information and Communications Technology (NICT) 
3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0289 Japan 



Key-Sun Choi kschoi@cs.kaist.ac.kr 

Computer Science Division, Department of EECS 

Korea Advanced Institute of Science and Technology (KAIST) 

373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701 Republic of Korea 

Hitoshi Isahara isahara@nict.go.jp 
Computational Linguistics Group 

National Institute of Information and Communications Technology (NICT) 
3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0289 Japan 



Abstract 

Machine transliteration is a method for automatically converting words in one lan- 
guage into phonetically equivalent ones in another language. Machine transliteration plays 
an important role in natural language applications such as information retrieval and ma- 
chine translation, especially for handling proper nouns and technical terms. Four machine 
transliteration models - grapheme-based transliteration model, phoneme-based translitera- 
tion model, hybrid transliteration model, and correspondence-based transliteration model - 
have been proposed by several researchers. To date, however, there has been little research 
on a framework in which multiple transliteration models can operate simultaneously. Fur- 
thermore, there has been no comparison of the four models within the same framework and 
using the same data. We addressed these problems by 1) modeling the four models within 
the same framework, 2) comparing them under the same conditions, and 3) developing a 
way to improve machine transliteration through this comparison. Our comparison showed 
that the hybrid and correspondence-based models were the most effective and that the 
four models can be used in a complementary manner to improve machine transliteration 
performance. 



1. Introduction 

With the advent of new technology and the flood of information through the Web, it has 
become increasingly common to adopt foreign words into one's language. This usually en- 
tails adjusting the adopted word's original pronunciation to follow the phonological rules 
of the target language, along with modification of its orthographical form. This phonetic 
"translation" of foreign words is called transliteration. For example, the English word 
data is transliterated into Korean as 'de-i-teo' 1 and into Japanese as 'de-e-ta'. Translit- 
eration is particularly used to translate proper names and technical terms from languages 

1. In this paper, target language transliterations are represented in their Romanized form with single 
quotation marks and hyphens between syllables. 



©2006 AI Access Foundation. All rights reserved. 



Oh, Choi, & Isahara 



using Roman alphabets into ones using non-Roman alphabets such as from English to 
Korean, Japanese, or Chinese. Because transliteration is one of the main causes of the 
out-of-vocabulary (OOV) problem, transliteration by means of dictionary lookup is imprac- 
tical (Fujii k Tetsuya, 2001; Lin k Chen, 2002). One way to solve the OOV problem is 
to use machine transliteration. Machine transliteration is usually used to support machine 
translation (MT) (Knight k Graehl, 1997; Al-Onaizan k Knight, 2002) and cross-language 
information retrieval (CLIR) (Fujii k Tetsuya, 2001; Lin k Chen, 2002). For CLIR, machine 
transliteration bridges the gap between the transliterated localized form and its original form 
by generating all possible transliterations from the original form (or generating all possible 
original forms from the transliteration) 2 . For example, machine transliteration can assist 
query translation in CLIR, where proper names and technical terms frequently appear in 
source language queries. In the area of MT, machine transliteration helps preventing trans- 
lation errors when translations of proper names and technical terms are not registered in 
the translation dictionary. Machine transliteration can therefore improve the performance 
of MT and CLIR. 

Four machine transliteration models have been proposed by several researchers: graph- 
emes-based transliteration model (ipc) (Lee k Choi, 1998; Jeong, Myaeng, Lee, k 
Choi, 1999; Kim, Lee, k Choi, 1999; Lee, 1999; Kang k Choi, 2000; Kang k Kim, 2000; 
Kang, 2001; Goto, Kato, Uratani, k Ehara, 2003; Li, Zhang, k Su, 2004), phoneme 4 - 
based transliteration model (ipp) (Knight k Graehl, 1997; Lee, 1999; Jung, Hong, k 
Paek, 2000; Meng, Lo, Chen, k Tang, 2001), hybrid transliteration model (iPh) (Lee, 
1999; Al-Onaizan k Knight, 2002; Bilac k Tanaka, 2004), and correspondence-based 
transliteration model (ipc) (Oh & Choi, 2002). These models are classified in terms of 
the units to be transliterated. The tpo is sometimes referred to as the direct method because 
it directly transforms source language graphemes into target language graphemes without 
any phonetic knowledge of the source language words. The ijjp is sometimes referred to as 
the pivot method because it uses source language phonemes as a pivot when it produces 
target language graphemes from source language graphemes. The ipp therefore usually 
needs two steps: 1) produce source language phonemes from source language graphemes; 
2) produce target language graphemes from source phonemes 5 . The i\)u and ipc make use 
of both source language graphemes and source language phonemes when producing target 
language transliterations. Hereafter, we refer to a source language grapheme as a source 



2. The former process is generally called "transliteration", and the latter is generally called "back- 
transliteration" (Knight & Graehl, 1997) 

3. Graphemes refer to the basic units (or the smallest contrastive units) of a written language: for example, 
English has 26 graphemes or letters, Korean has 24, and German has 30. 

4. Phonemes are the simplest significant unit of sound (or the smallest contrastive units of a spoken lan- 
guage); for example, /M/, /AE/, and /TH/ in /M AE TH/, the pronunciation of math. We use the 
ARPAbet symbols to represent source phonemes. ARPAbet is one of the methods used for coding source 
phonemes into ASCII characters (http: //www. cs . cinu.edu/~laura/pages/arpabet.ps). Here we denote 
source phonemes and pronunciation with two slashes, as in /AH/, and use pronunciation based on The 
CMU Pronunciation Dictionary and The American Heritage(r) Dictionary of the English Language. 

5. These two steps are explicit if the transliteration system produces target language transliterations after 
producing the pronunciations of the source language words; they are implicit if the system uses phonemes 
implicitly in the transliteration stage and explicitly in the learning stage, as described elsewhere (Bilac 
& Tanaka, 2004) 
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grapheme, a source language phoneme as a source phoneme, and a target language grapheme 
as a target grapheme. 

The transliterations produced by the four models usually differ because the models use 
different information. Generally, transliteration is a phonetic process, as in tpp, rather 
than an orthographic one, as in ipc (Knight & Graehl, 1997). However, standard translit- 
erations are not restricted to phoneme-based transliterations. For example, the standard 
Korean transliterations of data, amylase, and neomycin are, respectively, the phoneme- 
based transliteration 'de-i-teo', the grapheme-based transliteration 'a-mil-la-a-je', and 'ne- 
o-ma-i-sin', which is a combination of the grapheme-based transliteration 'ne-o' and the 
phoneme-based transliteration 'ma-i-sin'. Furthermore, if the unit to be transliterated is 
restricted to either a source grapheme or a source phoneme, it is hard to produce the correct 
transliteration in many cases. For example, tpp cannot easily produce the grapheme-based 
transliteration 'a-mil-la-a-je', the standard Korean transliteration of amylase, because ipp 
tends to produce 'a-mil-le-i-seu' based on the sequence of source phonemes /AE M AH 
L EY S/. Multiple transliteration models should therefore be applied to better cover the 
various transliteration processes. To date, however, there has been little published research 
regarding a framework in which multiple transliteration models can operate simultaneously. 
Furthermore, there has been no reported comparison of the transliteration models within 
the same framework and using the same data although many English-to-Korean transliter- 
ation methods based on i/jq have been compared to each other with the same data (Kang 
& Choi, 2000; Kang & Kim, 2000; Oh k Choi, 2002). 

To address these problems, we 1) modeled a framework in which the four translit- 
eration models can operate simultaneously, 2) compared the transliteration 
models under the same conditions, and 3) using the results of the comparison, 
developed a way to improve the performance of machine transliteration. 

The rest of this paper is organized as follows. Section 2 describes previous work relevant 
to our study. Section 3 describes our implementation of the four transliteration models. 
Section 4 describes our testing and results. Section 5 describes a way to improve machine 
transliteration based on the results of our comparison. Section 6 describes a translitera- 
tion ranking method that can be used to improve transliteration performance. Section 7 
concludes the paper with a summary and a look at future work. 

2. Related Work 

Machine transliteration has received significant research attention in recent years. In most 
cases, the source language and target language have been English and an Asian language, re- 
spectively - for example, English to Japanese (Goto et al., 2003), English to Chinese (Meng 
et al., 2001; Li et al., 2004), and English to Korean (Lee & Choi, 1998; Kim et al., 1999; 
Jeong et al., 1999; Lee, 1999; Jung et al., 2000; Kang & Choi, 2000; Kang & Kim, 2000; 
Kang, 2001; Oh & Choi, 2002). In this section, we review previous work related to the four 
transliteration models. 

2.1 Grapheme-based Transliteration Model 

Conceptually, the tpo is direct orthographical mapping from source graphemes to target 
graphemes. Several transliteration methods based on this model have been proposed, such 
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as those based on a source-channel model (Lee & Choi, 1998; Lee, 1999; Jeong et al., 
1999; Kim et al., 1999), a decision tree (Kang &: Choi, 2000; Kang, 2001), a transliteration 
network (Kang & Kim, 2000; Goto et al., 2003), and a joint source-channel model (Li et al., 
2004). 

The methods based on the source-channel model deal with English-Korean transliter- 
ation. They use a chunk of graphemes that can correspond to a source phoneme. First, 
English words are segmented into a chunk of English graphemes. Next, all possible chunks of 
Korean graphemes corresponding to the chunk of English graphemes are produced. Finally, 
the most relevant sequence of Korean graphemes is identified by using the source-channel 
model. The advantage of this approach is that it considers a chunk of graphemes repre- 
senting a phonetic property of the source language word. However, errors in the first step 
(segmenting the English words) propagate to the subsequent steps, making it difficult to 
produce correct transliterations in those steps. Moreover, there is high time complexity 
because all possible chunks of graphemes are generated in both languages. 

In the method based on a decision tree, decision trees that transform each source 
grapheme into target graphemes are learned and then directly applied to machine translit- 
eration. The advantage of this approach is that it considers a wide range of contextual 
information, say, the left three and right three contexts. However, it does not consider any 
phonetic aspects of transliteration. 

Kang and Kim (2000) and Goto et al. (2003) proposed methods based on a transliter- 
ation network for, respectively, English-to-Korean and English-to-Japanese transliteration. 
Their frameworks for constructing a transliteration network are similar - both are composed 
of nodes and arcs. A node represents a chunk of source graphemes and its corresponding 
target graphemes. An arc represents a possible link between nodes and has a weight showing 
its strength. Like the methods based on the source-channel model, their methods consider 
the phonetic aspect in the form of chunks of graphemes. Furthermore, they segment a chunk 
of graphemes and identify the most relevant sequence of target graphemes in one step. This 
means that errors are not propagated from one step to the next, as in the methods based 
on the source-channel model. 

The method based on the joint source-channel model simultaneously considers the source 
language and target language contexts (bigram and trigram) for machine transliteration. 
Its main advantage is the use of bilingual contexts. 

2.2 Phoneme-based Transliteration Model 

In the ipp, the transliteration key is pronunciation or the source phoneme rather than 
spelling or the source grapheme. This model is basically source grapheme-to-source phoneme 
transformation and source phoneme-to-target grapheme transformation. 

Knight and Graehl (1997) modeled Japanese-to-English transliteration with weighted 
finite state transducers (WFSTs) by combining several parameters including romaji-to- 
phoneme, phoneme-to-English, English word probabilities, and so on. A similar model was 
developed for Arabic-to-English transliteration (Stalls & Knight, 1998). Meng et al. (2001) 
proposed an English-to-Chinese transliteration method based on English grapheme-to-phoneme 
conversion, cross-lingual phonological rules, mapping rules between English phonemes and 
Chinese phonemes, and Chinese syllable-based and character-based language models. Jung 
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et al. (2000) modeled English-to-Korean transliteration with an extended Markov window. 
The method transforms an English word into English pronunciation by using a pronuncia- 
tion dictionary. Then it segments the English phonemes into chunks of English phonemes; 
each chunk corresponds to a Korean grapheme as defined by handcrafted rules. Finally, it 
automatically transforms each chunk of English phonemes into Korean graphemes by using 
an extended Markov window. 

Lee (1999) modeled English-to-Korean transliteration in two steps. The English grapheme- 
to-English phoneme transformation is modeled in a manner similar to his method based 
on the source-channel model described in Section 2.1. The English phonemes are then 
transformed into Korean graphemes by using English-to-Korean standard conversion rules 
(EKSCR) (Korea Ministry of Culture & Tourism, 1995). These rules are in the form of 
context-sensitive rewrite rules, "PaPxPb y" , meaning that English phoneme Px is 
rewritten as Korean grapheme y in the context Pa and Pb, where Px, Pa, and Pb rep- 
resent English phonemes. For example, "Pa = *,Px = /SH/,Pb = end — > 'si'" means 
"English phoneme /SH/ is rewritten into Korean grapheme 'si' if it occurs at the end of 
the word (end) after any phoneme (*)". This approach suffers from both the propagation 
of errors and the limitations of EKSCR. The first step, grapheme-to-phoneme transforma- 
tion, usually results in errors, and the errors propagate to the next step. Propagated errors 
make it difficult for a transliteration system to work correctly. In addition, EKSCR does 
not contain enough rules to generate correct Korean transliterations since its main focus is 
mapping from an English phoneme to Korean graphemes without taking into account the 
contexts of the English grapheme. 

2.3 Hybrid and Correspondence-based Transliteration Models 

Attempts to use both source graphemes and source phonemes in machine transliteration 
led to the correspondence-based transliteration model (ipc) (Oh & Choi, 2002) and the 
hybrid transliteration model (i/jh) (Lee, 1999; Al-Onaizan & Knight, 2002; Bilac & Tanaka, 
2004). The former makes use of the correspondence between a source grapheme and a source 
phoneme when it produces target language graphemes; the latter simply combines ipc and 
ipP through linear interpolation. Note that the ipH combines the grapheme-based translit- 
eration probability (Pr(^c)) and the phoneme-based transliteration probability (Pr(ipp)) 
using linear interpolation. 

Oh and Choi (2002) considered the contexts of a source grapheme and its correspond- 
ing source phoneme for English-to-Korean transliteration. They used EKSCR as the ba- 
sic rules in their method. Additional contextual rules are semi-automatically constructed 
by examining the cases in which EKSCR produced incorrect transliterations because of 
a lack of contexts. These contextual rules are in the form of context-sensitive rewrite 
rules, "CaCxCb — > y" , meaning "Cx is rewritten as target grapheme y in the context 
Ca and Cb" ■ Note that Cx, Ca, and Cb represent the correspondence between the En- 
glish grapheme and phoneme. For example, we can read "Ca = (* : /Vowel/), Cx = 
(r : /R/),Cb = (* : /Consonant/) — s* NULL" as "English grapheme r corresponding to 
phoneme / R/ is rewritten into null Korean graphemes when it occurs after vowel phonemes, 
(* : /Vowel/), before consonant phonemes, (* : /Consonant/)" . The main advantage of 
this approach is the application of a sophisticated rule that reflects the context of the source 
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grapheme and source phoneme by considering their correspondence. However, there is lack 
of portability to other languages because the rules are restricted to Korean. 

Several researchers (Lee, 1999; Al-Onaizan & Knight, 2002; Bilac & Tanaka, 2004) have 
proposed hybrid model-based transliteration methods. They model ipc and ipp with WF- 
STs or a source-channel model and combine and ipP through linear interpolation. In 
their tpp, several parameters are considered, such as the source grapheme-to-source phoneme 
probability, source phoneme-to-target grapheme probability, and target language word prob- 
ability. In their i/jq, the source grapheme-to-target grapheme probability is mainly consid- 
ered. The main disadvantage of the hybrid model is that the dependence between the source 
grapheme and source phoneme is not taken into consideration in the combining process; in 
contrast, Oh and Choi's approach (Oh & Choi, 2002) considers this dependence by using 
the correspondence between the source grapheme and phoneme. 

3. Modeling Machine Transliteration Models 

In this section, we describe our implementation of the four machine transliteration models 
(ipG> Y'Pj V'.ff) and tpc) using three machine learning algorithms: memory-based learning, 
decision-tree learning, and the maximum entropy model. 

3.1 Framework for Four Machine Transliteration Models 

Figure 1 summarizes the differences among the transliteration models and their component 
functions. The tpc directly transforms source graphemes (S) into target graphemes (T). 
The ij)p and tpc transform source graphemes into source phonemes and then generate target 
graphemes 6 . While tpp uses only the source phonemes, ipc uses the correspondence between 
the source grapheme and the source phoneme when it generates target graphemes. We 
describe their differences with two functions, 4>pt and </>tsp)T- The ipn is represented as the 
linear interpolation of Pr{ipG) an d Pr(V'p) by means of a (0 < a < 1). Here, Pr{ipp) is the 
probability that ipp will produce target graphemes, while Pr{tpo) is the probability that V'G 
will produce target graphemes. We can thus regard ipn as being composed of component 
functions of tpc and tpp (4>sPi 4>PTi and 4>St)- Here we use the maximum entropy model 
as the machine learning algorithm for ipjj because ipn requires Pr{ipp) and Pr{ipQ)i and 
only the maximum entropy model among memory-based learning, decision-tree learning, 
and the maximum entropy model can produce the probabilities. 

To train each component function, we need to define the features that represent training 
instances and data. Table 1 shows five feature types, fs, fp, fstype, fptype, and fx- The 
feature types used depend on the component functions. The modeling of each component 
function with the feature types is explained in Sections 3.2 and 3.3. 

3.2 Component Functions of Each Transliteration Model 

Table 2 shows the definitions of the four component functions that we used. Each is defined 
in terms of its input and output: the first and last characters in the notation of each 
correspond respectively to its input and output. The role of each component function in 

6. According to (gof)(x) = g(f(x)), we can write {(f>(sp)T°(t>Sp)(x) = <f)(SP)T(<f>Sp(x)) and (4>pt°4>sp){x) = 
<t>PT{<f>Sp{x)). 
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Figure 1: Graphical representation of each component function and four transliteration 
models: S is a set of source graphemes (e.g., letters of the English alphabet), P is 
a set of source phonemes defined in ARPAbet, and T is a set of target graphemes. 



Feature type 


Description and possible values 


fs,Stype 


fs 


Source graphemes in S: 

26 letters in English alphabet 


fstype 


Source grapheme types: 
Consonant (C) and Vowel (V) 


fp,Ptype 


fp 


Source phonemes in P 
(/AA/, /AE/, and so on) 


fptype 


Source phoneme types: Consonant (C), Vowel (V), 
Semi- vowel (SV), and silence (/~/) 


fx 


Target graphemes in T 



Table 1: Feature types used for transliteration models: fs,stype indicates both fs and fstype, 
while fp t pt yP e indicates both f P and fptype- 



each transliteration model is to produce the most relevant output from its input. The 
performance of a transliteration model therefore depends strongly on that of its component 
functions. In other words, the better the modeling of each component function, the better 
the performance of the machine transliteration system. 

The modeling strongly depends on the feature type. Different feature types are used 
by the 4>(sp)Ti 4>pt-, and <Pst functions, as shown in Table 2. These three component 
functions thus have different strengths and weaknesses for machine transliteration. The 
(psr function is good at producing grapheme-based transliterations and poor at producing 
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Notation 


Feature types used 


Input 


Output 


4>SP 


fs,Stype, fp 


Si) c(sj) 


Pi 




fs,Stype, fp,Ptype, /t 


Si,Pi,c(Si),c(pi) 


u 


4>PT 


fp,Ptypei It 


Pi,c(pi) 


ti 


4>ST 


fs,Stype, fT 




ti 



Table 2: Definition of each component function: Sj, c(sj), pj, c(pj), and ti respectively repre- 
sent the i th source grapheme, the context of Sj (sj- n , • • • , Sj_i and Sj+i, • • • , Sj+ n ), 
the i th source phoneme, the context of pi (pi_ n , ■ ■ ■ ,Pi-i and Pi+i, ■ ■ ■ ,pi+ n ), and 
the i th target grapheme. 



phoneme-based ones. In contrast, the (j)pT function is good at producing phoneme-based 
transliterations and poor at producing grapheme-based ones. For amylase and its standard 
Korean transliteration, 'a-mil-la-a-je', which is a grapheme-based transliteration, (ftsT tends 
to produce the correct transliteration; (ftp? tends to produce wrong ones like 'ae-meol-le-i- 
seu', which is derived from /AE M AH L EY S/, the pronunciation of amylase. In contrast, 
4>pt can produce 'de-i-teo', which is the standard Korean transliteration of data and a 
phoneme-based transliteration, while <Pst tends to give a wrong one, like 'da-ta'. 

The <j)(sp)T function combines the advantages of <Pst and 4>pt by utilizing the corre- 
spondence between the source grapheme and source phoneme. This correspondence en- 
ables 4>(sp)t to produce both grapheme-based and phoneme-based transliterations. Fur- 
thermore, the correspondence provides important clues for use in resolving transliteration 
ambiguities 7 . For example, the source phoneme /AH/ produces much ambiguity in ma- 
chine transliteration because it can be mapped to almost every vowel in the source and 
target languages (the underlined graphemes in the following example corresponds to /AH/: 
holocaust in English, 'hol-lo-ko-seu-teu' in its Korean counterpart, and 'ho-ro-ko-o-su-to' in 
its Japanese counterpart). If we know the correspondence between the source grapheme and 
source phoneme, we can more easily infer the correct transliteration of /AH/ because the 
correct target grapheme corresponding to /AH/ usually depends on the source grapheme 
corresponding to /AH/. Moreover, there are various Korean transliterations of the source 
grapheme a: 'a', 'ae', 'ei', 'i', and 'o'. In this case, the English phonemes corresponding 
to the English grapheme can help a component function resolve transliteration ambigui- 
ties, as shown in Table 3. In Table 3, the a underlined in the example words shown in 
the last column is pronounced as the English phoneme in the second column. By looking 
at English grapheme and its corresponding English phoneme, we can find correct Korean 
transliterations more easily. 

Though 4>(sp)t is more effective than both (psr and <f>pT in many cases, 4>(sp)t some- 
times works poorly when the standard transliteration is strongly biased to either grapheme- 
based or phoneme-based transliteration. In such cases, either the source grapheme or source 
phoneme does not contribute to the correct transliteration, making it difficult for 4>(sp)t 
to produce the correct transliteration. Because cfisT, <t>PT, and 4>{SP)T are the core parts 

7. Though contextual information can also be used to reduce ambiguities, we limit our discussion here to 
the feature type. 
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Korean Grapheme 


English Phoneme 


Example usage 


'a' 


/AA/ 


adagio, safari, vivace 


'ae' 


/AE/ 


advantage, alabaster, travertine 


'ei' 


/EY/ 


chamber, champagne, chaos 


'i' 


/IH/ 


advantage, average, silage 


'o' 


/AO/ 


allspice, ball, chalk 



Table 3: Examples of Korean graphemes derived from English grapheme a and its corre- 
sponding English phonemes: the underlines in the example words indicate the 
English grapheme corresponding to English phonemes in the second column. 



of V'G) V'P) an d ipc-, respectively, the advantages and disadvantages of the three component 
functions correspond to those of the transliteration models in which each is used. 

Transliteration usually depends on context. For example, the English grapheme a can 
be transliterated into Korean graphemes on the basis of its context, like 'ei' in the context 
of -ation and 'a' in the context of art. When context information is used, determining 
the context window size is important. A context window that is too narrow can degrade 
transliteration performance because of a lack of context information. For example, when 
English grapheme t in -tion is transliterated into Korean, the one right English grapheme is 
insufficient as context because the three right contexts, -ion, are necessary to get the correct 
Korean grapheme, 's'. A context window that is too wide can also degrade transliteration 
performance because it reduces the power to resolve transliteration ambiguities. Many 
previous studies have determined that an appropriate context window size is 3. In this 
paper, we use a window size of 3, as in previous work (Kang & Choi, 2000; Goto et al., 
2003). The effect of the context window size on transliteration performance will be discussed 
in Section 4. 

Table 4 shows how to identify the most relevant output in each component function using 
context information. The L3-L1, CO, and R1-R3 represent the left context, current context 
(i.e., that to be transliterated), and right context, respectively. The cfrsp function produces 
the most relevant source phoneme for each source grapheme. If SW = s± ■ S2 ■ ■ ■ ■ ■ s n is 
an English word, SW's pronunciation can be represented as a sequence of source phonemes 
produced by 4>sp', that is, Psw = Pi ~P2 • ■ ■ ■ -Pn, where pi = (j)sp(si, c(sj)). (frsp transforms 
source graphemes into phonemes in two ways. The first one is to search in a pronunciation 
dictionary containing English words and their pronunciation (CMU, 1997). The second one 
is to estimate the pronunciation (or automatic grapheme-to-phoneme conversion) (Ander- 
sen, Kuhn, Lazarides, Dalsgaard, Haas, &; Noth, 1996; Daelemans & van den Bosch, 1996; 
Pagel, Lenzo, &: Black, 1998; Damper, Marchand, Adamson, &: Gustafson, 1999; Chen, 
2003). If an English word is not registered in the pronunciation dictionary, we must esti- 
mate its pronunciation. The produced pronunciation is used for <Ppt in ipp and 4>(sp)T lTi 
ipc- For training the automatic grapheme-to-phoneme conversion in </>sp, we use The CMU 
Pronouncing Dictionary (CMU, 1997). 

The (frsTi <pPTi and 4>(sp)t functions produce target graphemes using their input. Like 
4>sp, these three functions use their previous outputs, which are represented by fx- As 
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Type 


L3 


L2 


LI 


CO 


Rl 


R2 


R3 




Output 




fs 


$ 


$ 


$ 


b 


o 


a 


r 






T Or 




$ 


$ 


$ 


C 


V 


V 


C 


— >■ 


/B/ 




fp 


$ 


$ 


$ 


e 








fs 


$ 


$ 


$ 


b 


o 


a 


r 






&QT 


fstvve 
j o type 


$ 


$ 


$ 


C 


V 


V 


C 


— > 


'b' 




fr 

J J- 


$ 


$ 


$ 


e 








JP 


<p 




<p 


/B/ 


/AO/ 


/-/ 
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Table 4: Framework for each component function: $ represents start of words and e means 
unused contexts for each component function. 



shown in Table 4, (psT, <Ppt, and 0(sp)t produce target grapheme 'b' for source grapheme 
b and source phoneme /B/ in board and /B AO R D/. Because the 6 and /B/ are the 
first source grapheme of board and the first source phoneme of /B AO RD/, respectively, 
their left context is $, which represents the start of words. Source graphemes (o, a, and r) 
and their type (V: vowel, V: vowel, and C: consonant) can be the right context in (fisT and 
4>(SP)T- Source phonemes (/AO/, /~/, and /R/) and their type (V: vowel, /~/: silence, 
V: vowel) can be the right context in 4>pt and <t>(sp)T- Depending on the feature type 
used in each component function and described in Table 2, 4>st, <Ppt, and <t>(sP)T produce 
a sequence of target graphemes, T$w = t± ■ fa ■ . . . • t n , for SW = s± ■ S2 ■ ■ ■ ■ ■ s n and 
Psw = Pi " P2 ■ ■ ■ ■ ■ Pn- For board, SW, Psw, and T$w can be represented as follows. The 
/~/ represents silence (null source phonemes), and the '~' represents null target graphemes. 

• SW = s± ■ S2 ■ S3 ■ S4 ■ S5 = b ■ o ■ a ■ r ■ d 

• Psw =Pi-P2-P3-Pa-P5 = IB I ■ j AO j ■/ ~ /-/R/-/D/ 

• T sw = h-t 2 -t 3 -t 4 -t 5 = 'b'- 'o' • '~' • '~' • 'deu' 

3.3 Machine Learning Algorithms for Each Component Function 

In this section we describe a way to model component functions using three machine learn- 
ing algorithms (the maximum entropy model, decision-tree learning, and memory-based 
learning) 8 . Because the four component functions share a similar framework, we limit our 
focus to <P(sp)t m t ms section. 

8. These three algorithms are typically applied to automatic grapheme-to-phoneme conversion (Andersen 
et al., 1996; Daelemans & van den Bosch, 1996; Pagel et al., 1998; Damper et al., 1999; Chen, 2003). 
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3.3.1 Maximum entropy model 

The maximum entropy model (MEM) is a widely used probability model that can in- 
corporate heterogeneous information effectively (Berger, Pietra, & Pietra, 1996). In the 
MEM, an event (ev) is usually composed of a target event (te) and a history event (he); 
say ev =< te, he >. Event ev is represented by a bundle of feature functions, fei(ev), 
which represent the existence of certain characteristics in event ev. A feature function is 
a binary- valued function. It is activated (fei(ev) = 1) when it meets its activating condi- 
tion; otherwise it is deactivated (fei(ev) = 0) (Berger et al., 1996). Let source language 
word SW be composed of n graphemes. SW, Psw-, and T$w can then be represented as 
SW = si, ■ ■ ■ , s n , P S w = Pi,--- ,Pn, and T S w = h,---,t n , respectively. P S w and T S w 
represent the pronunciation and target language word corresponding to SW, and p, t and U 
represent the source phoneme and target grapheme corresponding to Sj. Function <f>(sP)T 
based on the maximum entropy model can be represented as 

Pr(T S w\SW, P S w) = Pr(h, ■ ■ ■ ,t n \s!, ■ ■ ■ , s n ,pi, ■ ■ ■ ,p n ) (1) 

With the assumption that 4>(sp)t depends on the context information in window size k, we 
simplify Formula (1) to 

Pr(T sw \SW, Psw) ~ Y[Pr(U\ti_ k , ■ ■ ■ ,t^i,pi_ k , • • • ,p i+k ,Si- k , ■ ■ -,s i+k ) (2) 

i 

Because ti, ■ ■ ■ , t n , si, • • • , s n , and p±, ■ ■ ■ ,p n can be represented by f T , fs,st ype , and fp,p tyP e, 
respectively, we can rewrite Formula (2) as 

Pr(T SW \SW,Psw) ~ II Pr (^l/ T «-M-i)' fP,Ptype (i - k ,i+ k) , fs,Stype (i _ k:i+k) ) (3) 

i 

where i is the index of the current source grapheme and source phoneme to be transliterated 
and fx{i,m) represents the features of feature type fx located from position / to position m. 

An important factor in designing a model based on the maximum entropy model is 
to identify feature functions that effectively support certain decisions of the model. Our 
basic philosophy of feature function design for each component function is that the context 
information collocated with the unit of interest is important. We thus designed the feature 
function with collocated features in each feature type and between different feature types. 
Features used for <fi(sp)T are listed below. These features are used as activating conditions 
or history events of feature functions. 

• Feature type and features used for designing feature functions in (f><sp)T (k = 3) 

— All possible features in fs,Stypei- kii+k , fp,Ptypei_ k ^ i+k , and /Ti_ M _i (e.g., fSi-!, 
fp^, and f Ti _j) 

— All possible feature combinations between features of the same feature type (e.g., 
{fSi-2, /Si_i, fs i+1 }, {fPi-2? fPi, /p; +2 } ; and {/Ti_ 2 , /Ti_i}) 

— All possible feature combinations between features of different feature types (e.g., 
{/S;_i, fPi-A, {/5i_i, /t-_ 2 } , and {fptypa-21 fPi-31 /t 1 - 2 }) 

* between f s ,Stypei_ kti+k and fp,Ptypei^ k i+k 
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fej 


te 


he 


k 


f^(i-k,i-l) 


/ S,Stype^_ k>i+k ) 


fp,Ptype (l _ kyi+k) 


fei 


'b' 




f Si = b 


fPi = IB/ 


fe-2 


'b' 




fSi-i = $ 




fe3 


'b' 


te-i = $ 


/s i+ i = o and fst yP e l+2 = V 


fPi = IB/ 


/e 4 


'b' 






f Pi+1 = /AO/ 


fen 


'b' 


/T;-2 = $ 


fs i+3 = r 


fptypei = C 



Table 5: Feature functions for 4>(sp)t derived from Table 4. 



* between fs,Styp ei _ k>i+k and h^ k ^ x 

* between fp, Pt yp ei _ kA+k and h^ k ^ 

Generally, a conditional maximum entropy model that gives the conditional probability 
Pr(y\x) is represented as Formula (4) (Berger et al., 1996). 

Pr(y\x) = ——exp(^2Xifei(x,y)) (4) 

^ ' i 

z ( x ) = *52 ex p(52^if e i( x ,y)) 

y i 

In 4>isp)Ti the target event (te) is target graphemes to be assigned, and the history event 
(he) can be represented as a tuple < fr^^ , fs,st yP e (l _ Kt+k) , fp,Pt yP e (i _ kti+k) >■ Therefore, 
we can rewrite Formula (3) as 

,Stype (i _ k}i+k) j fp,Ptype(i_ k4+k) ) (5) 
= Pr{te\he) = ex P(Yl X if e i( he i te )) 

Table 5 shows example feature functions for 4>(sp)t] Table 4 was used to derive the 
functions. For example, je\ represents an event where he (history event) is u fs i is b and 
f Pi is /B/" and te (target event) is "/x; is 'b'". To model each component function based 
on the MEM, Zhang's maximum entropy modeling tool is used (Zhang, 2004). 

3.3.2 Decision-tree learning 

Decision-tree learning (DTL) is one of the most widely used and well-known methods for 
inductive inference (Quinlan, 1986; Mitchell, 1997). ID3, which is a greedy algorithm 
that constructs decision trees in a top-down manner, uses the information gain, which is a 
measure of how well a given feature (or attribute) separates training examples on the basis of 
their target class (Quinlan, 1993; Manning & Schutze, 1999). We use C4.5 (Quinlan, 1993), 
which is a well-known tool for DTL and an implementation of Quinlan's ID3 algorithm. 

The training data for each component function is represented by features located in L3- 
Ll, CO, and R1-R3, as shown in Table 4. C4.5 tries to construct a decision tree by looking 
for regularities in the training data (Mitchell, 1997). Figure 2 shows part of the decision 
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tree constructed for 4>(sp)t m English-to-Korean transliteration. A set of the target classes 
in the decision tree for 4>(sp)t is a se t °f t ne target graphemes. The rectangles indicate the 
leaf nodes, which represent the target classes, and the circles indicate the decision nodes. 
To simplify our examples, we use only fs and fp. Note that all feature types for each 
component function, as described in Table 4, are actually used to construct decision trees. 
Intuitively, the most effective feature from among L3-L1, CO, and R1-R3 for 4>(sp)t may be 
located in CO because the correct outputs of 4>(sp)t strongly depend on the source grapheme 
or source phoneme in the CO position. As we expected, the most effective feature in the 
decision tree is located in the CO position, that is, C0(/p). (Note that the first feature 
to be tested in decision trees is the most effective feature.) In Figure 2, the decision tree 
produces the target grapheme (Korean grapheme) 'o' for the instance x(SPT) by retrieving 
the decision nodes from CO(fp) = /AO/ to Rl(fp) = / ~ / represented by '*'. 



<3 W f ):/AO/j g> 




x(SPT) 


Feature type 
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L2 


LI 


CO 


Rl 
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R3 








fs 
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'o' 




fp 
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/B/ 


/AO/ 


/-/ 


/R/ 


/D/ 







Figure 2: Decision tree for 4>(sp)t- 



3.3.3 Memory-based learning 

Memory-based learning (MBL), also called "instance-based learning" and "case-based learn- 
ing", is an example-based learning method. It is based on a /c-nearest neighborhood algo- 
rithm (Aha, Kibler, & Albert, 1991; Aha, 1997; Cover & Hart, 1967; Devijver & Kittler., 
1982). MBL represents training data as a vector and, in the training phase, it places all 
training data as examples in memory and clusters some examples on the basis of the k- 
nearest neighborhood principle. Training data for MBL is represented in the same form 



as training data for a decision tree. Note that the target classes for 



(SP)T, 



which MBL 



outputs, are target graphemes. Feature weighting to deal with features of differing impor- 
tance is also done in the training phase 9 . It then produces an output using similarity-based 



9. TiMBL (Daelemans, Zavrel, Sloot, & Bosch, 2004) supports gain ratio weighting, information gain 
weighting, chi-sguared (\ 2 ) weighting, and shared variance weighting of the features. 
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reasoning between test data and the examples in memory. If the test data is x and the 
set of examples in memory is Y, the similarity between x and Y can be estimated using 
distance function A(x,y) 10 . MBL selects an example yi or the cluster of examples that are 
most similar to x and then assigns the example's target class to x's target class. We use 
an MBL tool called TiMBL (Tilburg memory-based learner) version 5.0 (Daelemans et al., 
2004). 

4. Experiments 

We tested the four machine transliteration models on English-to-Korean and English-to- 
Japanese transliteration. The test set for the former (EKSet) (Nam, 1997) consisted of 
7,172 English-Korean pairs - the number of training items was about 6,000 and that of the 
blind test items was about 1,000. EKSet contained no transliteration variations, meaning 
that there was one transliteration for each English word. The test set for the latter (EJSet) 
contained English-katakana pairs from EDICT (Breen, 2003) and consisted of 10,417 pairs 
- the number of training items was about 9,000 and that of the blind test items was about 
1,000. EJSet contained transliteration variations, like <micro, 'ma-i-ku-ro'>, and <micro, 
'mi-ku-ro'>; the average number of Japanese transliterations for an English word was 1.15. 
EKSet and EJSet covered proper names, technical terms, and general terms. We used 
The CMU Pronouncing Dictionary (CMU, 1997) for training pronunciation estimation (or 
automatic grapheme-to-phoneme conversion) in (f>sp- The training for automatic grapheme- 
to-phoneme conversion was done ignoring the lexical stress of vowels in the dictionary (CMU, 
1997). The evaluation was done in terms of word accuracy (WA), the evaluation measure 
used in previous work (Kang & Choi, 2000; Kang & Kim, 2000; Goto et al., 2003; Bilac & 
Tanaka, 2004). Here, WA can be represented as Formula (6). A generated transliteration 
for an English word was judged to be correct if it exactly matched a transliteration for that 
word in the test data. 

^ number of correct transliterations output by system 
number of transliterations in blind test data 

In the evaluation, we used fc-fold cross-validation (k=7 for EKSet and k=W for EJSet). The 
test set was divided into k subsets. Each was used in turn for testing while the remainder was 
used for training. The average WA computed across all k trials was used as the evaluation 
results presented in this section. 
We conducted six tests. 

• Hybrid Model Test: Evaluation of hybrid transliteration model by changing value of a 
(the parameter of the hybrid transliteration model) 

• Comparison Test I: Comparison among four machine transliteration models 

• Comparison Test II: Comparison of four machine transliteration models to previously 
proposed transliteration methods 

10. Modified value difference metric, overlap metric, Jeffrey divergence metric, dot product metric, etc. are 
used as the distance function (Daelemans et al., 2004). 
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• Dictionary Test: Evaluation of transliteration models on words registered and not 
registered in pronunciation dictionary to determine effect of pronunciation dictionary 
on each model 

• Context Window-Size Test: Evaluation of transliteration models for various sizes of 
context window 

• Training Data-Size Test: Evaluation of transliteration models for various sizes of train- 
ing data sets 

4.1 Hybrid Model Test 

The objective of this test was to estimate the dependence of the performance of i/jh on 
parameter a. We evaluated the performance by changing a from to 1 at intervals of 
0.1 (i.e., a=0, 0.1, 0.2, • • •, 0.9, 1.0). Note that the hybrid model can be represented as 
"a x Pr(tpp) + (1 — a) x Pr(tpG)" ■ Therefore, tpn is tpG when a = and ipp when a = 1. 
As shown in Table 6, the performance of ipu depended on that of V'g and ipp. For example, 
the performance of ipG exceeded that of if)p for EKSet. Therefore, ipn tended to perform 
better when a < 0.5 than when a > 0.5 for EKSet. The best performance was attained 
when a = 0.4 for EKSet and when a = 0.5 for EJSet. Hereinafter, we use a = 0.4 for 
EKSet and a = 0.5 for EJSet as the linear interpolation parameter for ipjj- 



a 


EKSet 


EJSet 





58.8% 


58.8% 


0.1 


61.2% 


60.9% 


0.2 


62.0% 


62.6% 


0.3 


63.0% 


64.1% 


0.4 


64.1% 


65.4% 


0.5 


63.4% 


65.8% 


0.6 


61.1% 


65.0% 


0.7 


59.6% 


63.4% 


0.8 


58.2% 


62.1% 


0.9 


57.0% 


61.2% 


1.0 


55.2% 


59.2% 



Table 6: Results of Hybrid Model Test. 



4.2 Comparison Test I 

The objectives of the first comparison test were to compare performance among the four 
transliteration models (Y'G; ipp, ipH, an d ipc) an d to compare the performance of each model 
with the combined performance of three of the models {ipc+p+c)- Table 7 summarizes the 
performance of each model for English-to-Korean and English-to-Japanese transliteration, 
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where DTL, MBL and MEM represent decision-tree learning, memory-based learning, 
and maximum entropy model. 

The unit to be transliterated was restricted to either a source grapheme or a source 
phoneme in ipc and ipp; it was dynamically selected on the basis of the contexts in ipn 
and ipc- This means that ipc and ipp could produce an incorrect result if either a source 
phoneme or a source grapheme, which, respectively, they do not consider, holds the key to 
producing the correct transliteration result. For this reason, ipn and ipc performed better 
than both ipc and if) p. 



Transliteration Model 


EKSet 


EJSet 


DTL 


MBL 


MEM 


DTL 


MBL 


MEM 


^pG 


53.1% 


54.6% 


58.8% 


55.6% 


58.9% 


58.8% 


IpP 


50.8% 


50.6% 


55.2% 


55.8% 


56.1% 


59.2% 


IpH 


N/A 


N/A 


64.1% 


N/A 


N/A 


65.8% 


ipC 


59.5% 


60.3% 


65.5% 


64.0% 


65.8% 


69.1% 


*pG+P+C 


72.0% 


71.4% 


75.2% 


73.4% 


74.2% 


76.6% 



Table 7: Results of Comparison Test I. 



In the table, ipc+p+c means the combined results for the three transliteration models, 
ipc, "tpPi and ipc- We exclude ipn from the combining because it is implemented only 
with the MEM (the performance of combining the four transliteration models are discussed 
in Section 5). In evaluating ipc+p+c, we judged the transliteration results to be correct 
if there was at least one correct transliteration among the results produced by the three 
models. Though ipc showed the best results among the three transliteration models due to 
its ability to use the correspondence between the source grapheme and source phoneme, the 
source grapheme or the source phoneme can create noise when the correct transliteration 
is produced by the other one. In other words, when the correct transliteration is strongly 
biased to either grapheme-based or phoneme-based transliteration, ipc and ipp may be more 
suitable for producing the correct transliteration. 

Table 8 shows example transliterations produced by each transliteration model. The 
ipG produced correct transliterations for cyclase and bacteroid, while ipp did the same for 
geoid and silo, ipc produced correct transliterations for saxhorn and bacteroid, and ipn 
produced correct transliterations for geoid and bacteroid. As shown by these results, there 
are transliterations that only one transliteration model can produce correctly. For example, 
only ipc, ipp, and ipc produced the correct transliterations of cyclase, silo, and saxhorn, 
respectively. Therefore, these three transliteration models can be used in a complementary 
manner to improve transliteration performance because at least one can usually produce the 
correct transliteration. This combination increased the performance by compared to ipc, 
ipp, and ipc (on average, 30.1% in EKSet and 24.6% in EJSet). In short, ipc, ipp, and ipc are 
complementary transliteration models that together produce more correct transliterations, 

11. We tested all possible combinations between A(x,Y) and a weighting scheme supported by 
TiMBL (Daelemans et al., 2004) and did not detect any significant differences in performance for the 
various combinations. Therefore, we used the default setting of TiMBL (Overlap metric for A(x, Y) and 
gain ratio weighting for feature weighting). 
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so combining different transliteration models can improve transliteration performance. The 
transliteration results produced by ipc+P+c are analyzed in detail in Section 5. 









cyclase 


'si-keul-la-a-je' 


*'sa-i-keul-la-a-je' 


bacteroid 


'bak-te-lo-i-deu' 


*'bak-teo-o-i-deu' 


geoid 


*'je-o-i-deu' 


'ji-o-i-deu' 


silo 


*'sil-lo' 


'sa-il-lo' 


saxhorn 


*'saek-seon' 


*'saek-seu-ho-leun' 








cyclase 


*'sa-i-keul-la-a-je' 


*'sa-i-keul-la-a-je' 


bacteroid 


'bak-te-lo-i-deu' 


'bak-te-lo-i-deu' 


geoid 


'ji-o-i-deu' 


*'ge-o-i-deu' 


silo 


*'sil-lo' 


*'sil-lo' 


saxhorn 


*'saek-seon' 


'saek-seu-hon' 



Table 8: Example transliterations produced by each transliteration model (* indicates an 
incorrect transliteration) . 

In our subsequent testing, we used the maximum entropy model as the machine learning 
algorithm for two reasons. First, it produced the best results of the three algorithms we 
tested 12 . Second, it can support ipu- 

4.3 Comparison Test II 

In this test, we compared four previously proposed machine transliteration methods (Kang 
k Choi, 2000; Kang & Kim, 2000; Goto et al., 2003; Bilac & Tanaka, 2004) to the four 
transliteration models (tpG, i>P-> ipH, and ipc), which were based on the MEM. Table 9 shows 
the results. We trained and tested the previous methods with the same data sets used for 
the four transliteration models. Table 10 shows the key features of the methods and models 
from the viewpoint of information type and usage. Information type indicates the type of 
information considered: source grapheme, source phoneme, and correspondence between 
the two. For example, the first three methods use only the source grapheme. Information 
usage indicates the context used and whether the previous output is used. 

It is obvious from the table that the more information types a transliteration model 
considers, the better its performance. Either the source phoneme or the correspondence - 
which are not considered in the methods of Kang and Choi (2000), Kang and Kim (2000), 
and Goto et al. (2003) - is the key to the higher performance of the method of Bilac and 
Tanaka (2004) and the ipn and ipc- 

From the viewpoint of information usage, the models and methods that consider the 
previous output tended to achieve better performance. For example, the method of Goto et 
al. (2003) had better results than that of Kang and Choi (2000). Because machine translit- 

12. A one-tail paired t-test showed that the results with the MEM were always significantly better (except 
for 4>g in EJSet) than those of DTL and MBL (level of significance = 0.001). 
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Method/Model 






Previous methods 


Kang and Choi (2000) 


51.4% 


50.370 


Kang and Kim (2000) 


cr 10/ 

oo.Ito 


53.2 /o 


Goto et al. (2003) 


55.9% 


56.2% 


Bilac and Tanaka (2004) 


58.3% 


62.5% 


MEM-based models 


ipG 


58.8% 


58.8% 




55.2% 


59.2% 


ipH 


64.1% 


65.8% 




65.5% 


69.1% 



Table 9: Results of Comparison Test II. 



Method/Model 


Info, type 


Info, usage 


S 


P 


c 


Context 


PO 


Kang and Choi (2000) 


+ 






< -3 ~ +3 > 




Kang and Kim (2000) 


+ 






Unbounded 


+ 


Goto et al. (2003) 


+ 






< -3 ~ +3 > 


+ 


Bilac and Tanaka (2004) 


+ 


+ 




Unbounded 




IpG 


+ 






< -3 ~ +3 > 


+ 


lp P 




+ 




< -3 ~ +3 > 


+ 




+ 


+ 




< -3 ~ +3 > 


+ 




+ 


+ 


+ 


< -3 ~ +3 > 


+ 



Table 10: Information type and usage for previous methods and four transliteration mod- 
els, where S, P, C, and PO respectively represent the source grapheme, source 
phoneme, correspondence between S and P, and previous output. 



eration is sensitive to context, a reasonable context size usually enhances transliteration 
ability. Note that the size of the context window for the previous methods was limited to 3 
because a context window wider than 3 degrades performance (Kang & Choi, 2000) or does 
not significantly improve it (Kang & Kim, 2000). Experimental results related to context 
window size are given in Section 4.5. 

Overall, tpjj and ipc had better performance than the previous methods (on average, 
17.04% better for EKSet and 21.78% better for EJSet), tpo (on average, 9.6% better for 
EKSet and 14.4% better for EJSet), and ipp (on average, 16.7% better for EKSet and 
19.0% better for EJSet). In short, a good machine transliteration model should 1) consider 
either the correspondence between the source grapheme and the source phoneme or both 
the source grapheme and the source phoneme, 2) have a reasonable context size, and 3) 
consider previous output. The tpn and tpc satisfy all three conditions. 
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4.4 Dictionary Test 

Table 11 shows the performance of each transliteration model for the dictionary test. In this 
test, we evaluated four transliteration models according to a way of pronunciation generation 
(or grapheme-to-phoneme conversion). Registered represents the performance for words 
registered in the pronunciation dictionary, and Unregistered represents that for unregistered 
words. On average, the number of Registered words in EKSet was about 600, and that in 
EJSet was about 700 in /e-fold cross-validation test data. In other words, Registered words 
accounted for about 60% of the test data in EKSet and about 70% of the test data in 
EJSet. The correct pronunciation can always be acquired from the pronunciation dictionary 
for Registered words, while the pronunciation must be estimated for Unregistered words 
through automatic grapheme-to-phoneme conversion. However, the automatic grapheme- 
to-phoneme conversion does not always produce correct pronunciations - the estimated rate 
of correct pronunciations was about 70% accuracy. 





EKSet 


EJSet 


Registered 


Unregistered 


Registered 


Unregistered 




60.91% 


55.74% 


61.18% 


50.24% 


ipp 


66.70% 


38.45% 


64.35% 


40.78% 


IpH 


70.34% 


53.31% 


70.20% 


50.02% 


4>c 


73.32% 


54.12% 


74.04% 


51.39% 


ALL 


80.78% 


68.41% 


81.17% 


62.31% 



Table 11: Results of Dictionary Test: ALL means ipc+P+H+c- 



Analysis of the results showed that the four transliteration models fall into three cate- 
gories. Since the ipc is f ree from the need for correct pronunciation, that is, it does not use 
the source phoneme, its performance is not affected by pronunciation correctness. Therefore, 
ipG can be regarded as the baseline performance for Registered and Unregistered. Because 
■tpP (<f>pT o <f> S p), ipH («x Pr(ip P )+(l -a)x Pr(ip G )), and ip c (<P(sp)t ° <Psp) depend on 
the source phoneme, their performance tends to be affected by the performance of 4>sp- 
Therefore, ipp, tpn, and ipc show notable differences in performance between Registered 
and Unregistered. However, the performance gap differs with the strength of the depen- 
dence, ipp falls into the second category: its performance strongly depends on the correct 
pronunciation, ipp tends to perform well for Registered and poorly for Unregistered, ipn 
and ipc weakly depend on the correct pronunciation. Unlike ipp, they make use of both 
the source grapheme and source phoneme. Therefore, they can perform reasonably well 
without the correct pronunciation because using the source grapheme weakens the negative 
effect of incorrect pronunciation in machine transliteration. 

Comparing ipc and ipp, we find two interesting things. First, ipp was more sensitive to 
errors in (psp for Unregistered. Second, ipc showed better results for both Registered and 
Unregistered. Because ipp and ipc share the same function, <psp, the key factor accounting 
for the performance gap between them is the component functions, <Ppt and 4>isp)t- From 
the results shown in Table 11, we can infer that <p(SP)T ( m V'c) performed better than 
4>pt (in ipp) for both Registered and Unregistered. In 4>(sp)Ti the source grapheme corre- 
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sponding to the source phonemes, which (ppT does not consider, made two contributions 
to the higher performance of 4>(sp)t- First, the source grapheme in the correspondence 
made it possible to produce more accurate transliterations. Because <fr(sP)T considers the 
correspondence, (p(sp)T nas a more powerful transliteration ability than 4>pt, which uses 
just the source phonemes, when the correspondence is needed to produce correct transliter- 
ations. This is the main reason 4>(sp)t performed better than <Ppt for Registered. Second, 
source graphemes in the correspondence compensated for errors produced by <psp hi pro- 
ducing target graphemes. This is the main reason <P(sp)t performed better than (ppx for 
Unregistered. In the comparison between ipc and ipGi the performances were similar for Un- 
registered. This indicates that the transliteration power of ipc is similar to that of ipc, even 
though the pronunciation of the source language word may not be correct. Furthermore, the 
performance of ipc was significantly higher than that of ipc for Registered. This indicates 
that the transliteration power of ipc is greater than that of ipc if the correct pronunciation 
is given. 

The behavior of tpn was similar to that of ipc- For Unregistered, Pr{ipc) in ipn made 
it possible for ipu to avoid errors caused by Pr(ipp). Therefore, it worked better than ipp. 
For Registered, Pr(ipp) enabled ipu to perform better than ipc- 

The results of this test showed that ipn and ipc perform better than ipc and ipp while 
complementing ipc and ipp (and thus overcoming their disadvantage) by considering either 
the correspondence between the source grapheme and the source phoneme or both the 
source grapheme and the source phoneme. 

4.5 Context Window-Size Test 

In our testing of the effect of the context window size, we varied the size from 1 to 5. 
Regardless of the size, ipn and ipc always performed better than both ipc and ipp. When 
the size was 4 or 5, each model had difficulty identifying regularities in the training data. 
Thus, there were consistent drops in performance for all models when the size was increased 
from 3 to 4 or 5. Although the best performance was obtained when the size was 3, as shown 
in Table 12, the differences in performance were not significant in the range of 2-4. However, 
there was a significant difference between a size of 1 and a size of 2. This indicates that 
a lack of contextual information can easily lead to incorrect transliteration. For example, 
to produce the correct target language grapheme of t in -tion, we need the right three 
graphemes (or at least the right two) of t, -ion (or -io). The results of this testing indicate 
that the context size should be more than 1 to avoid degraded performance. 

4.6 Training Data-Size Test 

Table 13 shows the results of the Training Data-Size Test using MEM-based machine 
transliteration models. We evaluated the performance of the four models and ALL while 
varying the size of the training data from 20% to 100%. Obviously, the more training data 
used, the higher the system performance. However, the objective of this test was to deter- 
mine whether the transliteration models perform reasonably well even for a small amount 
of training data. We found that ipc was the most sensitive of the four models to the amount 
of training data; it had the largest difference in performance between 20% and 100%. In 
contrast, ALL showed the smallest performance gap. The results of this test shows that 



138 



A Comparison of Machine Transliteration Models 



EKSet 


Context Size 


^G 








ALL 


1 


44.9% 


44.9% 


51.8% 


52.4% 


65.8% 


2 


57.3% 


52.8% 


61.7% 


64.4% 


74.4% 


3 


58.8% 


55.2% 


64.1% 


65.5% 


75.8% 


4 


56.1% 


54.6% 


61.8% 


64.3% 


74.4% 


5 


53.7% 


52.6% 


60.4% 


62.5% 


73.9% 


EJSet 


Context Size 


^G 


ipp 




V'c 


ALL 


1 


46.4% 


52.1% 


58.0% 


62.0% 


70.4% 


2 


58.2% 


59.5% 


65.6% 


68.7% 


76.3% 


3 


58.8% 


59.2% 


65.8% 


69.1% 


77.0% 


4 


56.4% 


58.5% 


64.4% 


68.2% 


76.0% 


5 


53.9% 


56.4% 


62.9% 


66.3% 


75.5% 



Table 12: Results of Context Window-Size Test: ALL means tpc+P+H+c- 



combining different transliteration models is helpful in producing correct transliterations 
even if there is little training data. 



EKSet 


Training Data Size 


V>G 






V'c 


ALL 


20% 


46.6% 


47.3% 


53.4% 


57.0% 


67.5% 


40% 


52.6% 


51.5% 


58.7% 


62.1% 


71.6% 


60% 


55.2% 


53.0% 


61.5% 


63.3% 


73.0% 


80% 


58.9% 


54.0% 


62.6% 


64.6% 


74.7% 


100% 


58.8% 


55.2% 


64.1% 


65.5% 


75.8% 


EJSet 


Training Data Size 


V>G 


ijj P 




V'c 


ALL 


20% 


47.6% 


51.2% 


56.4% 


60.4% 


69.6% 


40% 


52.4% 


55.1% 


60.7% 


64.8% 


72.6% 


60% 


55.2% 


57.3% 


62.9% 


66.6% 


74.7% 


80% 


57.9% 


58.8% 


65.4% 


68.0% 


76.7% 


100% 


58.8% 


59.2% 


65.8% 


69.1% 


77.0% 



Table 13: Results of Training Data-Size Test: ALL means 4>g+p+h+c- 



5. Discussion 

Figures 3 and 4 show the distribution of the correct transliterations produced by each 
transliteration model and by the combination of models, all based on the MEM. The ipGi 
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ipp, V'H) an d V'c i n the figures represent the set of correct transliterations produced by each 
model through fc-fold validation. For example, \iPg\ = 4,220 for EKSet and \iPg\ = 6,121 
for EJSet mean that ipG produced 4,220 correct transliterations for 7,172 English words 
in EKSet (\KTG\ in Figure 3) and 6,121 correct ones for 10,417 English words in EJSet 
(\JTG\ in Figure 4). An important factor in modeling a transliteration model is to reflect the 
dynamic transliteration behaviors, which means that a transliteration process dynamically 
uses the source grapheme and source phoneme to produce transliterations. Due to these 
dynamic behaviors, a transliteration can be grapheme-based transliteration, phoneme-based 
transliteration, or some combination of the two. The forms of transliterations are classified 
on the basis of the information upon which the transliteration process mainly relies (either 
a source grapheme or a source phoneme or some combination of the two). Therefore, an 
effective transliteration system should be able to produce various types of transliterations 
at the same time. One way to accommodate the different dynamic transliteration behaviors 
is to combine different transliteration models, each of which can handle a different behavior. 
Synergy can be achieved by combining models so that one model can produce the correct 
transliteration when the others cannot. Naturally, if the models tend to produce the same 
transliteration, less synergy can be realized from combining them. Figures 3 and 4 show the 
synergy gained from combining transliteration models in terms of the size of the intersection 
and the union of the transliteration models. 




(a) TpG+1pP+4>C (b) rpG+TpP+lpH (c) Tpp+1pH+4>C (d) 1pG+1pH+1pC 



Figure 3: Distributions of correct transliterations produced by models for English-to- 
Korean transliteration. KTG represents "Korean Transliterations in the Gold 
standard". Note that \ip G U ip P U ip H U ipc\ = 5,439, \ip G O ipp D ip H O ipc\ = 
3,047, and \KTG\ = 7,172. 

The figures show that, as the area of intersection between different transliteration models 
becomes smaller, the size of their union tends to become bigger. The main characteristics 
obtained from these figures are summarized in Table 14. The first thing to note is that 
\ipG H V'pI is clearly smaller than any other intersection. The main reason for this is that 
ipG and ip P use no common information (ipG uses source graphemes while ipp uses source 
phonemes). However, the others use at least one of source grapheme and source phoneme 
(source graphemes are information common to ipo, ipH, an d ipc while source phonemes 
are information common to ipp, ipn, an d ipc)- Therefore, we can infer that the synergy 
derived from combining ipc an d ipp is greater than that derived from the other combinations. 
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(a) lpG+1pP+4>C (b) rpG+TpP+lpH (c) Tpp+1pH+4>C (d) 1pG+1pH+1pC 



Figure 4: Distributions of correct transliterations produced by models for English-to- 
Japanese transliteration. JTG represents "Japanese Transliterations in the Gold 
standard". Note that \ip G U^U^U Vc|=8,021, \ip G n Vp n Vp <~l Vc|=4,786, 
and |JTG| = 10,417. 





EKSet 


EJSet 


IV'g I 


4,202 


6,118 


V'p 


3,947 


6,158 


IV'hI 


4,583 


6,846 




4,680 


7,189 


IV'g n V'pI 


3,133 


4,937 


IV'g n V'cl 


3,731 


5,601 


IV'g n V'pI 


4,025 


5,731 


IV'g n V'pI 


4,136 


6,360 


\ip P n V'cl 


3,675 


5,759 


\4>p n Vp 


3,583 


5,841 


IV'gU Vp| 


5,051 


7,345 


IV'g u V'cl 


5,188 


7,712 


IV'g u V'pI 


4,796 


7,239 


IVc U V'pI 


5,164 


7,681 


\ipp u V'cl 


4,988 


7,594 


\ipp U V'pI 


4,982 


7,169 



Table 14: Main characteristics obtained from Figures 3 and 4. 



However, the size of the union between the various pairs of transliteration models in Table 14 
shows that |V>c U V'pI and IV'G U V'cl are bigger than \ip G U V'pI- The main reason for this 
might be the higher transliteration power of ip G and V'P compared to that of tp G and ip P 
- Vc and Vp cover more of the KTG and JTG than both Vg and Vp- The second thing 
to note is that the contribution of each transliteration model to \tp G U Vp U V'P U V'cl can 
be estimated from the difference between IV'g U V'P U V'P U V'cl and the union of the three 
other transliteration models. For example, we can measure the contribution of V'P from the 
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difference between \ipc U tpp U ipn U V>c| and U U Vc|- As shown in Figures 3(a) 
and 4(a)), ipjj makes the smallest contribution while ipc (Figures 3(b) and 4(b)) makes the 
largest contribution. The main reason for t/jjj making the smallest contribution is that it 
tends to produce the same transliteration as the others, so the intersection between ipu and 
the others tends to be large. 

It is also important to rank the transliterations produced by a transliteration system for 
a source language word on the basis of their relevance. While a transliteration system can 
produce a list of transliterations, each reflecting a dynamic transliteration behavior, it will 
fail to perform well unless it can distinguish between correct and wrong transliterations. 
Therefore, a transliteration system should be able to produce various kinds of translitera- 
tions depending on the dynamic transliteration behaviors and be able to rank them on the 
basis of their relevance. In addition, the application of transliteration results to natural 
language applications such as machine translation and information retrieval requires that 
the transliterations be ranked and assigned a relevance score. 

In summary, 1) producing a list of transliterations reflecting dynamic translit- 
eration behaviors (one way is to combine the results of different transliteration models, 
each reflecting one of the dynamic transliteration behaviors) and 2) ranking the translit- 
erations in terms of their relevance are both necessary to improve the performance of 
machine transliteration. In the next section, we describe a way to calculate the relevance 
of transliterations produced by a combination of the four transliteration models. 

6. Transliteration Ranking 

The basic assumption of transliteration ranking is that correct transliterations are more 
frequently used in real- world texts than incorrect ones. Web data reflecting the real- world 
usage of transliterations can thus be used as a language resource to rank transliterations. 
Transliterations that appear more frequently in web documents are given either a higher 
rank or a higher score. The goal of transliteration ranking, therefore, is to rank correct 
transliterations higher and rank incorrect ones lower. The transliterations produced for a 
given English word by the four transliteration models (ipG, typ, ipH, and ipc), based on the 
MEM, were ranked using web data. 

Our transliteration ranking relies on web frequency (number of web documents). To 
obtain reliable web frequencies, it is important to consider a transliteration and its cor- 
responding source language word together rather than the transliteration alone. This is 
because our aim is to find correct transliterations corresponding to a source language word 
rather than to find transliterations that are frequently used in the target language. There- 
fore, the best approach to transliteration ranking using web data is to find web documents 
in which transliterations are used as translations of the source language word. 

A bilingual phrasal search (BPS) retrieves the Web with a Web search engine query, 
which is a phrase composed of a transliteration and its source language word (e.g., {'a-mil- 
la-a-je' amylase}). The BPS enables the Web search engine to find web documents that 
contain correct transliterations corresponding to the source language word. Note that a 
phrasal query is represented in brackets, where the first part is a transliteration and the 
second part is the corresponding source language word. Figure 5 shows Korean and Japanese 
web documents retrieved using a BPS for amylase and its Korean/ Japanese transliterations, 
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Retrieved Korean web pages for query 
{ 'a-mil-la-a-je' amylase 




amylase" 



a gA) 1W2WM nnw« utmsswi raiBi-ia font 



Query 



Retrieved Japanese web pages for query 
{ 'a-mi-ra-a-je' amylase] 

] 111 \£\&\"T~=,-iz amylase" - 



i - - ■ . .1 . ■ 'i 



£^ M 

:002-09-[ 




OF^ErOrXil amylase 
OtmZlO\J\\ (amylase) 
OfMEiOtM [amylase] 

'a-mil-la-a-je' amylase 
'a-mil-la-a-je' (amylase) 
'a-mil-la-a-je' [amylase] 



h^EHlHI. =!0\ME MJI. ... 
cempas.com/ dic.html?q^%BE%C6%B9%D0%BB%F3%BE%CB%Cl % 
AB&qn^Sm^B - 52k - - ; \a\J] - a||G|^| 

off! j^i^a^ 

... gsii¥Si5 =fsgcH 513, s^8saisl»a*£a Dma™ 



mm 

... 6 ^fflS 
HOST 
■ : .' ■ bigli '■ 




75^ — If amylase 
75^ — If (amylase) 
75^ — If [amylase] 

'a-mi-ra-a-je' amylase 
'a-mi-ra-a-je' (amylase) 
'a-mi-ra-a-je' [amylase] 



bunsein.hp.info5eek.co.jpraYOKA.hlm- 

...SSfflKEITiS'^-jEUCISgiH-Stftlzli 
■ 




Figure 5: Desirable retrieved web pages for transliteration ranking. 



'a-mil-la-a-je' and 'a-mi-ra-a-je'. The web documents retrieved by a BPS usually contain a 
transliteration and its corresponding source language word as a translation pair, with one 
of them often placed in parentheses, as shown in Figure 5. 

A dilemma arises, though, regarding the quality and coverage of retrieved web docu- 
ments. Though a BPS generally provides high-quality web documents that contain correct 
transliterations corresponding to the source language word, the coverage is relatively low, 
meaning that it may not find any web documents for some transliterations. For exam- 
ple, a BPS for the Japanese phrasal query {'a-ru-ka-ro-si-su' alkalosis} and the Korean 
phrasal query {'eo-min' ermine} found no web documents. Therefore, alternative search 
methods are necessary when the BPS fails to find any relevant web documents. A bilingual 
keyword search (BKS) (Qu k, Grefenstette, 2004; Huang, Zhang, k, Vogel, 2005; Zhang, 
Huang, & Vogel, 2005) can be used when the BPS fails, and a monolingual keyword search 
(MKS) (Grefenstette, Qu, & Evans, 2004) can be used when both the BPS and BKS fail. 
Like a BPS, a BKS makes use of two keywords, a transliteration and its source language 
word, as a search engine query. Whereas a BPS retrieves web documents containing the 
two keywords as a phrase, a BKS retrieves web documents containing them anywhere in 
the document. This means that the web frequencies of noisy transliterations are sometimes 
higher than those of correct transliterations in a BKS, especially when the noisy translitera- 
tions are one-syllable transliterations. For example, 'mok', which is a Korean transliteration 
produced for mook and a one-syllable noisy transliteration, has a higher web frequency than 
'mu-keu', which is the correct transliteration for mook, because 'mok' is a common Korean 
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noun that frequently appears in Korean texts with the meaning of neck. However, a BKS 
can improve coverage without a great loss of quality in the retrieved web documents if the 
transliterations are composed of two or more syllables. 

Though a BKS has higher coverage than a BPS, it can fail to retrieve web documents 
in some cases. In such cases, an MKS (Grefenstette et al., 2004) is used. In an MKS, 
a transliteration alone is used as the search engine query A BPS and a BKS act like a 
translation model, while an MKS acts like a language model. Though an MKS cannot give 
information as to whether the transliteration is correct, it does provide information as to 
whether the transliteration is likely to be a target language word. The three search methods 
are used sequentially (BPS, BKS, MKS). If one method fails to retrieve any relevant web 
documents, the next one is used. Table 15 summarizes the conditions for applying each 
search method. 

Along with these three search strategies, three different search engines are used to obtain 
more web documents. The search engines used for this purpose should satisfy two condi- 
tions: 1) support Korean/ Japanese web document retrieval and 2) support both phrasal 
and keyword searches. Google 13 , Yahoo 14 , and MSN 15 satisfy these conditions, and we used 
them as our search engines. 



Search method 


Condition 


BPS 


EiEc k ecWF BPSj (e,c k ))^0 


BKS 


EjEc k ecWF BPSj (e,c k )) = 
EiE Ch ecWF BKSj (e,c k ))^0 


MKS 


E j Ec k ecWF BPSj (e,c k ) = 
E j Ec k ecWF BKSj (e,c k ) = 

EiEc^cWFMKSj&C^^O 



Table 15: Conditions under which each search method is applied. 



RF(e, «) = £ NWF j{ e, «) = £ - (7) 

Web frequencies acquired from these three search methods and these three search en- 
gines were used to rank transliterations on the basis of Formula (7), where c% is the i th 
transliteration produced by the four transliteration models, e is the source language word 
of Cj, RF is a function for ranking transliterations, WF is a function for calculating web 
frequency, NWF is a function for normalizing web frequency, C is a set of produced translit- 
erations, and j is an index for the j th search engine. We used the normalized web frequency 
as a ranking factor. The normalized web frequency is the web frequency divided by the 
total web frequency of all produced transliterations corresponding to one source language 
word. The score for a transliteration is then calculated by summing up the normalized 

13. http://www.google.com 

14. http://www.yaiioo.com 

15. http://www.msn.com 
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web frequencies of the transliteration given by the three search engines. Table 16 shows an 
example ranking for the English word data and its possible Korean transliterations, 'de-i- 
teo', 'de-i-ta', and 'de-ta', which web frequencies are obtained using a BPS. The normalized 
WFbps (NWFpps) for search engine A was calculated as follows. 

• NWF B ps (data, 'de-i-teo') = 94,100 / (94,100 + 67,800 + 54) = 0.5811 

• NWFbps (data, 'de-i-ta') = 67,800 / (94,100 + 67,800 + 54) = 0.4186 

• NWFbps (data, 'de-ta') = 54 / (94,100 + 67,800 + 54) = 0.0003 

The ranking score for 'de-i-teo' was then calculated by summing up NWFpps (data, 'de-i- 
teo') for each search engine: 

• RF B ps (data, 'de-i-teo') = 0.5810 + 0.7957 + 0.3080 = 1.6848 



Search Engine 


e=data 


c\= 'de-i-teo' 


C2= 'de-i-ta' 


03= 'de-ta' 


WF 


NWF 


WF 


NWF 


WF 


NWF 


A 


94,100 


0.5811 


67,800 


0.4186 


54 


0.0003 


B 


101,834 


0.7957 


26,132 


0.2042 


11 


0.0001 


C 


1,358 


0.3080 


3,028 


0.6868 


23 


0.0052 


RF 


1.6848 


1.3096 


0.0056 



Table 16: Example transliteration ranking for data and its transliterations; WF, NWF, 
and RF represent WFpps, NWFpps, and RFpps, respectively. 



6.1 Evaluation 

We tested the performance of the transliteration ranking under two conditions: 1) with all 
test data (ALL) and 2) with test data for which at least one transliteration model produced 
the correct transliteration (CTC). Testing with ALL showed the overall performance of the 
machine transliteration while testing with CTC showed the performance of the translit- 
eration ranking alone. We used the performance of the individual transliteration models 
(ipGi ipPi ipH, and ipc) as the baseline. The results are shown in Table 17. "Top-n" means 
that the correct transliteration was within the Top-n ranked transliterations. The average 
number of produced Korean transliterations was 3.87 and that of Japanese ones was 4.50; 
note that tpp and tpc produced more than one transliteration because of pronunciation 
variations. The results for both English-to-Korean and English-to-Japanese transliteration 
indicate that our ranking method effectively filters out noisy transliterations and positions 
the correct transliterations in the top rank; most of the correct transliterations were in 
Top-1. We see that transliteration ranking (in Top-1) significantly improved performance 
of the individual models for both EKSet and EJSet 16 . The overall performance of the 

16. A one-tail paired t-test showed that the performance improvement was significant (level of significance 
= 0.001. 
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transliteration (for ALL) as well that of the ranking (for CTC) were relatively good. No- 
tably, the CTC performance showed that web data is a useful language resource for ranking 
transliterations. 



Test data 




TIT/ O j_ 

EKSet 


T71 TO j_ 

EJSet 


ALL 




58.8% 


58.8% 




55.2% 


59.2% 


IpH 


64.1% 


65.8% 




65.5% 


69.1% 


ALL 


Top-1 


71.5% 


74.8% 


Top-2 


75.3% 


76.9% 


Top-3 


75.8% 


77.0% 


CTC 


Top-1 


94.3% 


97.2% 


Top-2 


99.2% 


99.9% 


Top-3 


100% 


100% 



Table 17: Results of Transliteration ranking. 



6.2 Analysis of Results 

We defined two error types: production errors and ranking errors. A production error 
is when there is no correct transliteration among the produced transliterations. A ranking 
error is when the correct transliteration does not appear in the Top-1 ranked transliterations. 

We examined the relationship between the search method and the transliteration rank- 
ing. Table 18 shows the ranking performance by each search method. The RTC represents 
correct transliterations ranked by each search method. The NTC represents test data 
ranked, that is, the coverage of each search method. The ratio of RTC to NTC represents 
the upper bound of performance and the difference between RTC and NTC is the number 
of errors. 

The best performance was with a BPS. A BPS handled 5,270 out of 7,172 cases for 
EKSet and 8,829 out of 10,417 cases for EJSet. That is, it did the best job of retrieving 
web documents containing transliteration pairs. Analysis of the ranking errors revealed 
that the main cause of such errors in a BPS was transliteration variations. These variations 
contribute to ranking errors in two ways. First, when the web frequencies of transliteration 
variations are higher than those of the standard ones, the variations are ranked higher than 
the standard ones, as shown by the examples in Table 19. Second, when the transliterations 
include only transliteration variations (i.e., there are no correct transliterations), the correct 
ranking cannot be. In this case, ranking errors are caused by production errors. With a 
BPS, there were 603 cases of this for EKSet and 895 cases for EJSet. 

NTC was smaller with a BKS and an MKS because a BPS retrieves web documents 
whenever possible. Table 18 shows that production errors are the main reason a BPS fails 
to retrieve web documents. (When a BKS or MKS was used, production errors occurred in 
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EKSet 


EJSet 


BPS 


BKS 


MKS 


BPS 


BKS 


MKS 


Top-1 


83.8% 


55.1% 


16.7% 


86.2% 


19.0% 


2.7% 


Top-2 


86.6% 


58.4% 


27.0% 


88.3% 


22.8% 


4.2% 


Top-3 


86.6% 


58.2% 


31.3% 


88.35% 


22.9% 


4.3% 


RTC 


4,568 


596 


275 


7,800 


188 


33 


NTC 


5,270 


1,024 


878 


8,829 


820 


768 



Table 18: Ranking performance of each search method. 





Transliteration 


Web Frequency 


compact — > Korean 


'kom-paek-teu' 


1,075 


'keom-paek-teu'* 


1,793 


pathos — > Korean 


'pa-to-seu' 


1,615 


'pae-to-seu'* 


14,062 


cohen — > Japanese 


'ko-o-he-n' 


23 


'ko-o-e-n'* 


112 


criteria — > Japanese 


'ku-ra-i-te-ri-a' 


104 


'ku-ri-te-ri-a'* 


1,050 



Table 19: Example ranking errors when a BPS was used (* indicates a variation). 



all but 871 17 cases for EKSet and 221 18 cases for EJSet). The results also show that a BKS 
was more effective than an MKS. 

The trade-off between the quality and coverage of retrieved web documents is an im- 
portant factor in transliteration ranking. A BPS provides better quality rather than wider 
coverage, but is effective since it provides reasonable coverage. 

7. Conclusion 

We tested and compared four transliteration models, grapheme-based transliteration 
model (V>g)> phoneme-based transliteration model (ipp), hybrid transliteration 
model (iPh), and correspondence-based transliteration model (ipc), for English-to- 
Korean and English-to-Japanese transliteration. We modeled a framework for the four 
transliteration models and compared them within the framework. Using the results, we 
examined a way to improve the performance of machine transliteration. 

We found that the i(jh and ipc are more effective than the ipc and ipp. The main reason 
for the better performance of ipc is that it uses the correspondence between the source 
grapheme and the source phoneme. The use of this correspondence positively affected 
transliteration performance in various tests. 



17. 596 (RTC of BKS in EKSet) + 275 (RTC of MKS in EKSet) = 871 

18. 188 (RTC of BKS for EJSet) + 33 (RTC of MKS for EJSet) = 221 
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We demonstrated that ipo, Y\Pj ^h, and ipc can be used as complementary translitera- 
tion models to improve the chances of producing correct transliterations. A combination of 
the four models produced more correct transliterations both in English-to-Korean translit- 
eration and English-to-Japanese transliteration compared to each model alone. Given these 
results, we described a way to improve machine transliteration that combines different 
transliteration models: 1) produce a list of transliterations by combining transliter- 
ations produced by multiple transliteration models; 2) rank the transliterations 
on the basis of their relevance. 

Testing showed that transliteration ranking based on web frequency is an effective way 
to calculate the relevance of transliterations. This is because web data reflects real-world 
usage, so it can be used to filter out noisy transliterations, which are not used as target 
language words or are incorrect transliterations for a source language word. 

There are several directions for future work. Although we considered some translit- 
eration variations, our test sets mainly covered standard transliterations. In corpora or 
web pages, however, we routinely find other types of transliterations - misspelled translit- 
erations, transliterations of common phrases, etc. - along with the standard translitera- 
tions and transliteration variations. Therefore, further testing using such transliterations 
is needed to enable the transliteration models to be compared more precisely. To achieve 
a machine transliteration system capable of higher performance, we need a more sophisti- 
cated transliteration method and a more sophisticated ranking algorithm. Though many 
correct transliterations can be acquired through the combination of the four transliteration 
models, there are still some transliterations that none of the models can produce. We need 
to devise a method that can produce them. Our transliteration ranking method works well, 
but, because it depends on web data, it faces limitations if the correct transliteration does 
not appear in web data. We need a complementary ranking method to handle such cases. 
Moreover, to demonstrate the effectiveness of these four transliteration models, we need to 
apply them to various natural language processing applications. 
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